-
Notifications
You must be signed in to change notification settings - Fork 2.7k
smoke test allow pass for flaky providers #6638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
From conversation in Discord maybe we close this and look into the gemini issue since that seems concerning ( |
michaelneale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is ok for now - and will have a follow up to chase these
Hi Zane! If the gemini3 failures are with code_execution, I've #6555 that would possibly fix the issue, as I don't see empty response issues with it locally. |
Signed-off-by: fbalicchia <[email protected]>
* origin/main: Fix GCP Vertex AI global endpoint support for Gemini 3 models (#6187) fix: macOS keychain infinite prompt loop (#6620) chore: reduce duplicate or unused cargo deps (#6630) feat: codex subscription support (#6600) smoke test allow pass for flaky providers (#6638) feat: Add built-in skill for goose documentation reference (#6534) Native images (#6619) docs: ml-based prompt injection detection (#6627) Strip the audience for compacting (#6646) chore(release): release version 1.21.0 (minor) (#6634) add collapsable chat nav (#6649) fix: capitalize Rust in CONTRIBUTING.md (#6640) chore(deps): bump lodash from 4.17.21 to 4.17.23 in /ui/desktop (#6623) Vibe mcp apps (#6569) Add session forking capability (#5882) chore(deps): bump lodash from 4.17.21 to 4.17.23 in /documentation (#6624) fix(docs): use named import for globby v13 (#6639) PR Code Review (#6043) fix(docs): use dynamic import for globby ESM module (#6636) # Conflicts: # Cargo.lock # crates/goose-server/src/routes/session.rs
…o dkatz/canonical-context * 'dkatz/canonical-provider' of github.com:block/goose: (27 commits) docs: add Remotion video creation tutorial (#6675) docs: export recipe and copy yaml (#6680) Test against fastmcp (#6666) docs: mid-session changes (#6672) Fix MCP elicitation deadlock and improve UX (#6650) chore: upgrade to rmcp 0.14.0 (#6674) [docs] add MCP-UI to MCP Apps blog (#6664) ACP get working dir from args.cwd (#6653) Optimise load config in UI (#6662) Fix GCP Vertex AI global endpoint support for Gemini 3 models (#6187) fix: macOS keychain infinite prompt loop (#6620) chore: reduce duplicate or unused cargo deps (#6630) feat: codex subscription support (#6600) smoke test allow pass for flaky providers (#6638) feat: Add built-in skill for goose documentation reference (#6534) Native images (#6619) docs: ml-based prompt injection detection (#6627) Strip the audience for compacting (#6646) chore(release): release version 1.21.0 (minor) (#6634) add collapsable chat nav (#6649) ...
Summary
Goose created this or we could just remove the experimentals for now.
Investigation Results
I investigated the flaky
smoke-tests-code-execjob using the GitHub CLI and found:Root Cause: Two models have inconsistent tool-calling behavior:
google:gemini-3-pro-preview- Most frequent offender (~80% of failures). Sometimes returns empty responses without making any tool calls.openrouter:nvidia/nemotron-3-nano-30b-a3b- Occasional failures with similar behavior.Pattern: When these models fail, they return nothing within ~5 seconds. When they succeed, they take ~45 seconds and properly call tools. This is typical of preview/experimental models.
Timeline:
gemini-3-pro-previewadded Nov 19, 2025nvidia/nemotron-3-nano-30b-a3badded Dec 31, 2025Fix Applied
I modified
scripts/test_providers.shto add an "allowed failures" mechanism:ALLOWED_FAILURESarray listing the flaky modelsis_allowed_failure()function to check if a model is in the list⚠ FLAKYinstead of✗ FAILEDExpected Behavior After Fix
gemini-3-pro-previewfails: Test shows⚠ google: gemini-3-pro-preview (flaky)and the job passes✗ provider: modeland the job failsThis approach: